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This research paper presents an innovative solution for offline handwritten 
word recognition in Bengali, a prominent Indic language. The complexities of 
this script, particularly in cursive writing, often lead to overlapping characters 
and segmentation challenges. Conventional methodologies, reliant on 
individual character recognition and aggregation, are error-prone. To 


overcome these limitations, we propose a novel method treating the entire 


document as a coherent entity and utilizing the efficient you only look once 
Keywords: (YOLO) model for word extraction. In our approach, we view individual 
words as distinct objects and employ the YOLO model for supervised 
learning, transforming object detection into a regression problematic to 
predict spatially detached bounding boxes and class possibilities. Rigorous 
training results in outstanding performance, with remarkable box loss of 
0.014, obj loss of 0.14, and class loss of 0.009. Furthermore, the achieved 
mAP 0.5 score of 0.95 and map 0.5:0.95 score of 0.97 demonstrates the 
model's exceptional accuracy in detecting and recognizing handwritten 
words. To evaluate our method comprehensively, we introduce the Omor- 
Ekush dataset, a meticulously curated collection of 21,300 handwritten words 
from 150 participants, featuring 141 words per document. Our pioneering 
YOLO-based approach, combined with the curated Omor-Ekush dataset, 
represents a significant advancement in handwritten word recognition in 
Bengali. 
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1. INTRODUCTION 

There are around 300 million Indians globally, and Bengali is their primary language of 
communication. Bengali is the official language of Bangladesh and the Republic of India. It is among the most 
extensively used writing systems worldwide used by over 265 million people [1]. Word recognition is a 
technique that enables computers to identify written or printed words and convert them into a format that the 
computer can understand. Bangla word identification in handwritten writing is one of the utmost fascinating 
areas of research in today's world, and it is becoming more and more popular. Even in the twenty-first century, 
handwritten communication has its place and is virtually always used in daily life as a manner of capturing 
information that is intended to be shared with others. The need for online information systems has grown along 
with the expansion of the internet. Bangla handwritten word recognition research encompasses a wide range 
of applications, making it a vital field of study. These applications include document digitization, post offices, 
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banks, document analysis and recognition, education, and the preservation of historical documents, signature 
verification, and other institutions [1]. It also plays a crucial role in enhancing accessibility tools, streamlining 
e-commerce processes, and facilitating signature verification. Furthermore, the integration of handwriting 
recognition in personal assistants, note-taking apps, and search engines significantly improves user experience 
and caters to the needs of millions of Bengali speakers worldwide. The identification of Bangla handwritten 
words holds paramount significance, demanding substantial focus to advance various active applications in the 
field. Word detection in Bangla handwritten text is challenging due to complex alphabet shapes, variations 
caused by cursive writing, uneven lighting, and image distortions. Traditional techniques struggle with 
ambiguity, noise, and lack of standardization [2]. Innovative approaches and advanced models are necessary 
to improve word recognition for efficient document processing and applications in education and 
communication. To overcome the limitations in Bengali handwritten word recognition, we propose an 
innovative approach that treats the entire document as a coherent entity. By leveraging the efficiency of the 
you only look once (YOLO) model, we precisely extract words by viewing them as distinct objects. Through 
supervised learning, we transform object recognition into a regression problem, enabling us to predict spatially 
detached bounding containers and class possibilities with accuracy. Rigorous training of our YOLO model 
leads to outstanding performance, resulting in precise and efficient word recognition in Bengali sentences. 
Additionally, we curate a comprehensive and unique Bengali dataset named Omor Ekush containing complex 
words, and make it openly available for future research and advancements. This pioneering research 
significantly pushes the boundaries of Bengali handwritten word recognition, opening up new avenues for 
improved document processing and analysis capabilities. Extracting the words from scanned images containing 
handwriting in the Bengali language is the main objective of this paper. 

Figure 1(a) illustrates a sample of comparatively good handwriting in Bangla. The writing is clear, 
well-formed, and easily legible. In disparity, Figure 1(b) displays an example of comparatively cursive 
handwriting in Bangla, where characters are conjoined and the writing style is more fluid. Cursive handwriting 
poses challenges in word recognition due to overlapping and tilted characters, making it difficult for 
conventional recognition techniques. 


(b) 


Figure 1. Comparatively handwriting; (a) comparatively good handwriting and (b) comparatively cursive 
handwriting 


Figure 2 showcases the process of word recognition from handwritten Bangla text. The demo 
illustrates how the proposed method effectively identifies and extracts individual words from the handwritten 
text. By treating the entire document as a coherent entity and using the YOLO model, the system accurately 
predicts spatially separated bounding boxes and class probabilities for each word. Rigorous training ensures 
outstanding performance, demonstrating the model’s exceptional accuracy in recognizing and extracting words 
from handwritten Bangla text. 


Figure 2. Word recognition from handwriting 
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2. RELATED WORKS 

In this segment, we provide an overview of the existing studies in the arena of Bangla handwritten 
word appreciation. Previous research efforts in this area have primarily revolved around the acknowledgment 
of handwritten Bangla words. The central focus of prior investigations has been on developing methodologies 
and algorithms capable of accurately identifying and interpreting complete words written in the Bengali script. 
This task is particularly challenging due to the cursive nature of the calligraphy, which often leads to significant 
variations in character shapes and connectivity within a word. By examining the completed tasks related to 
Bangla handwritten word recognition, we gain insights into the progress made in this domain and the potential 
areas for further advancement. As we delve into the literature, we aim to highlight the gaps and opportunities 
that exist for developing more robust and sophisticated word recognition systems in the context of Bengali 
handwriting. 

Roy et al. [3] made the initial breakthrough in Bengali word recognition while working on Indian 
postal automation. They concentrated on recognizing Bengali words that consisted of exactly 76 city names 
and employed the 2D non-symmetric half plane-hidden Markov model (HMM) method for accomplishing this 
task. A more comprehensive analysis of their effort can be initiate in [4]. 

Pal et al. [5] proposed a model that they focused on unrestrained handwritten town title detection, 
utilizing a lexicon-driven tactic. They implemented the water-reservoir technique to segment city word images 
into primitives and then utilized dynamic programming to achieve finest character separation. In conclusion, 
they adopted a controlling element constructed modified quadratic discriminant function categorizer for the 
possibility calculation of the letterings. 

Bhowmik et al. [6] utilized a HMM to identify calligraphic city titles with the aid of a fixed-size 
lexicon. The HMM was trained using genetic algorithms. Their approach included a structural feature that 
employed a directional encoding scheme on boundaries. In a subsequent study [7], the same group made further 
improvements by dipping the lexicon dimensions, effectively narrowing down the search interplanetary. This 
reduction was achieved through an analysis of the perpendicular and straight hits of the word image. These 
advancements aimed to enhance the accuracy and efficiency of handwritten town name recognition, making 
notable contributions to the field of handwriting recognition research. 

Roy et al. [8] anticipated an HMM founded Bengali handwritten term recognizer. In their recognition 
module, they employed zone-wise horizontal dissection surveyed by vertical separation in the central region. 
They utilized the local gradient histogram as a feature set and fed it to a left-to-right HMM for recognizing the 
mid zone. A gradient feature-based support vector machine (SVM) classifier was employed to recognize the 
upper and lower zone modifiers. Next, they combined the zone-wise predictable results via a character 
arrangement approach. 

Bhunia et al. [9] utilized five types of features for middle zone recognition and compared their 
approach with their earlier work in [10]. They further combined their findings from [8], [9] to develop a unified 
approach for Devanagari and Bengali word acknowledgement in [11]. The experimental dataset used across 
these studies included various word images, incorporating numerous city names. Their cohesive approach 
contributes significantly to the advancement of word recognition in Bengali script and provides valuable 
insights for similar applications in other scripts. 

Adak et al. [12] introduced a neural network (NN) grounded system for Bengali handwritten word 
detection, via convolutional neural network (CNN) with long short-term memory (LSTM). They curated a 
novel dataset called NewISIdb also employed a unified architecture, combining CNNs with a recurrent model. 
LSTM blocks and connectionist temporal classification (CTC) layer further enhanced the recognition accuracy, 
showcasing the potential of neural nets in this domain. The study represents a significant advancement in 
Bengali handwritten word appreciation research. 

In our research, we employed an innovative solution for handwritten word detection in Bengali. Our 
novel method utilizes the YOLO model to process the entire document as a coherent entity, accurately 
predicting bounding boxes and class probabilities for word extraction. Rigorous training leads to outstanding 
performance, demonstrating exceptional accuracy in detecting and recognizing handwritten words. The curated 
Omor-Ekush dataset enhances our approach, providing a diverse benchmark for word recognition models. This 
pioneering YOLO-based method signifies a significant advancement in handwritten word recognition in 
Bengali, with transformative potential in document analysis, automated transcription, and language processing. 

Table 1 provides a comprehensive summary of research endeavors in the field of handwritten Bengali 
word recognition, highlighting the approaches employed and the research gaps addressed by each study. The 
table offers valuable insights into the contributions of each study and the specific areas in the literature they 
have targeted for improvement. 
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Table 1. Overview of handwritten Bengali word recognition studies and research gaps 


Study Focus Approach Research limitations 
Roy etal. Bengali word 2D NSHP-HMM technique for Limited research on recognizing specific Bengali 
[3] recognition recognizing 76 city names words, especially city names 
Pal et al. Handwritten city name — Lexicon-driven approach, water- Scarcity of methods for recognizing 
[5] recognition reservoir technique, and MQDF unconstrained handwritten city names in Bengali 
classifier script 
Bhowmik Handwritten town HMM with genetic algorithms, Need for efficient techniques to identify 
et al. [6] name recognition structural feature handwritten city names, particularly with lexicon 
of fixed size 
Roy etal. HMM-based Bengali Region-wise horizontal and vertical Advancements in zone-wise recognition for 
[8] Handwritten word segmentation, HMM, SVM classifier, Bengali handwritten words 
recognizer character alignment strategy 
Bhuniaet Middle zone Contrast of features, unified approach Enhancements in recognizing middle zones of 
al. [9] recognition for Bengali and Devanagari word Bengali handwritten words 
acknowledgment 
Adak et Bengali handwritten CNNs with LSTM, YOLO model, Utilization of CNNs with LSTM for Bengali 
al. [12] word recognition NewISIdb dataset, CTC layer handwritten word recognition, the introduction of 
the NewISIdb dataset 
Proposed Offline handwritten YOLO model to treat entire documents, | Addressing complexities in Bengali cursive 
method word recognition in predict bounding boxes, and class writing, introducing a novel approach using the 
Bengali probabilities for word extraction YOLO model and Omor-Ekush dataset for 


offline handwritten word recognition 


3. BACKGROUND STUDY 

The most recent task demonstrates that using artificial neural networks is advantageous for 
maintaining the image detection task. In the 1950s, researchers first developed concepts such as the perceptron 
learning algorithm [13]. Recent NNs base their operations on theories developed during the perceptron period. 
This unit begins by defining a neuron as a crucial component of contemporary NNs. Then it goes into further 
detail on CNNs and recurrent neural networks (RNNs). 


3.1. Perceptron 

The perceptron, an artificial neuron, forms the fundamental building block of modern neural networks, 
pivotal in machine learning. Thus, a perceptron is combined with a sole neuron. The activation a, which is the 
yield of the neuron, is mapped by d: RN > Ras (1): 


a = P(x) = o(wTx + b) (1) 


The weights w € R” and b € R and the nonlinear function o (+) map the feature vector x € RN. 
On the occasion of the perceptron o (-) viewpoints for: 


o(z) ={1 if z>0 otherwise 0 (2) 


3.2. Convolutional neural networks 

Convolution networks which has obligated a significant impression on the arena of image analysis 
and has been the basis for many recent advances in deep learning [14]. A CNN is a NN that processes an image 
also represents with a vector code. CNNs are founded upon fully connected neural networks in their 
architecture. Likewise, a convolutional network comprises multiple layers that process signals and propagate 
them forward. Nevertheless, unlike a route activation found in a fully allied layer, CNN activations possess the 
structure of three-dimensional tensors. Typically termed a “feature map,” this output tensor plays a crucial role 
in CNN. An example of this transformation is when the first convolutional layer takes an input image with 
dimensions 3 x W x H and produces a feature map with dimensions H' x W' x C. Here, C represents the 
quantity of features extracted by the layer. Essentially, a convolutional layer converts one volume into another. 
A standard CNN is composed of multiple convolutional stage, with dense layers at the top that transform the 
final convolutional dimensions obsessed by a vector output. In technical terms, the vector representation of an 
image is commonly referred to as fc7 features, as the seventh fully connected layer of the alexnet architecture 
originally provided the source of acquisition for this data [15]. Despite the fact that many newer architectures 
have surpassed the performance of alexnet, and current state-of-the-art designs differ from it, the term “fc7 
features" has remained popular within the field. Moreover, one can include an extra layer, like a soft-max layer, 
atop the fc7 features, depending on the particular problem that the network aims to address. A typical CNN 
architecture is illustrated in Figure 3. 
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Figure 3. Layer with CNN 


3.3. Pooling layer 

The design of convolutional layers allows for the preservation of spatial dimensions while increasing 
the depth of the network as information flows through it. However, the thing is applied to diminish three- 
dimensional, specifically in advanced layers. Dimensions diminish may be gained by exercising tread after 
convoluting, top to a series of interested areas join. Nevertheless, an extra forthright procedure was established 
termed a pooling layer. The input is divided and hooked on non-overlapping regions. And the layer produces 
a grid containing the extreme values from each region. Pooling layers are commonly used amongst 
convolutional layers in order to decrease the dimensionality of the data. 


3.4. Weights 

In a NN, each neuron produces an output by applying a specific function to the inputs it receives from 
the receptive field of the preceding layer. A weight vector and a bias determine the purpose which is pragmatic 
to the input data. Iteratively fine-tuning these biases and weights is what learning is all about. Filters are vectors 
of weights and biases that represent specific structures of the input. A unique characteristic of CNNs is the 
ability for multiple neurons to utilize the same filter. The use of a shared filter, with a single bias and weight 
vector, across multiple receptive fields reduces the memory footprint of the network, compared to having a 
separate bias and weight vector for each receptive field. 


4. DATASETS 
4.1. Previous datasets 

Handwriting acknowledgment has been a captivating research issue for almost fifty years. While early 
successes were achieved in recognizing simple handwritten digits. In 1992, the first census of optical character 
recognition system conference organized the pioneering large-scale character recognition challenge [16]. 
Following this, researchers gradually began to develop datasets for offline handwriting recognition at the 
sentence and document levels for the English language [17], [18]. The identity and access management (IAM) 
dataset was later utilized to launch one of the most well-known shared tasks in handwriting recognition the 
International Conference on Document Analysis and Recognition (ICDAR) challenge [19]. Our research 
revealed only a limited number of Bangla calligraphy datasets, with the widely held consisting of compound 
character data. BanglaLekha-Isolated [20] is an example of such a dataset. It comprises 10 numerals, 50 simple 
characters, and 24 cautiously chosen multifarious characters. The dataset includes 2000 individual images for 
each of the 84-character classes. After removing any scribbles, the final dataset contains a total of 
166,105 pictures of handwritten fonts. The dataset additionally incorporates information regarding the age and 
gender of the individuals who contributed the handwriting samples. 

Research by Rabby et al. [21] contains an alternative versatile handwritten character dataset that also 
contains 367,018 characters. The data was gathered from various regions of Bangladesh, with an equivalent 
number of female and male contributors from a range of age groups. In addition to the character data, the 
dataset also includes a set of modifiers, a feature not found in other similar character-level datasets. In addition 
to other resources, the ISI [22] and CMATERdb [23] datasets represent dual of the earliest collections of 
handwritten characters for the Bangla linguistic. The Bangla writing [10] dataset is the sole collection that 
bears similarity to our dataset with respect to word-level annotation. The dataset encompasses the handwriting 
samples from 260 individuals, varying in age and character. The creators utilized an explanation instrument to 
label the piece of paper with bounding boxes that enclose the unicode depiction of the words. This collection 
encompasses a significant number of 32,787 characters and 21,234 words, boasting a vocabulary size of 5,470. 
Despite the fact that all word label bounding boxes were created physically, the actual pounded truth of the 
pages from which the texts were generated was not provided. Furthermore, the majority of the pages were brief 
and could be perceived as resembling a paragraph rather than a complete document. 
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4.2. Dataset preparation 

This article introduces a Bangla handwriting dataset called Omor Ekush, which includes single-page 
handwriting samples from 150 individuals of varying ages and characters. Every page contains bounding boxes 
that enclose a piece of word, as well as the unicode depiction of the text. This collection encompasses 21,300 
words. All bounding boxes were initially created using our custom segmentation script, then manually verified 
and labeled. The dataset is suitable for advanced optical character/word acknowledgment, author proof of 
identity, handwritten word separation, and word generation. 


4.3. Dataset description 

Omor Ekush the dataset introduced in this article, strives to offer a superior handwriting collection 
that is enhanced in all aspects. The data set can be utilized for a range of claims based on machine learning and 
deep learning techniques. The dataset is applicable for writing biometric tasks, including identification and 
proof. Moreover, this technology shows promise in specific computer vision applications, including 
recognizing optical characters and segmenting handwriting. In addition, the dataset possesses the potential to 
power procreant calligraphy models. The construction and utilization of this dataset differ from typical Bangla 
datasets [20]. Current datasets for Bangla script only include individual character examples. In contrast, the 
Omor Ekush dataset includes word-based script with bounding boxes, similar to [10]. The dataset was built 
using established offline handwriting and writer recognition datasets as a foundation [18]. The majority of 
larger datasets, such as KHATT [24] and IAM [18], incorporate automated and pre-determined parameters for 
data labeling. In contrast, the Omor Ekush dataset’s annotations and labels were manually created. 


4.3. Dataset generation 
4.3.1. Data acquisition 

We want our dataset to cover all the characters in Bengali language, that’s why we carefully chose two 
Bengali pangrams from the wikipedia. Figure 4 depicts the selected pangrams. We break the line spacing in the 
verse pangram to save the page space. As the rhythm is not important for the dataset. Combining both pangrams, 
a single document contains approximately 141 words, but due to some word duplication the number of unique 
words becomes 137. Figure 5 represents the sample data taken from write then according to perform our model. 


aes foret CATA, WGA COCA AAT SAC WI CATI ACA SANA SSI, HAA CHIC OV GAPS CICA SAA, ALT va Sha Prev focal CAT calc 
ATIA GIF, MEI GT SHCA AST ACA ATH SATA, MAL GAS VCH HCH AH Coa ACATCAT TSA YBA VIS CAT ETI SOL ACA PAT GAC AVA] ACA CMAN” 
“SUR WATT AH FS ACT Sa ARA Bea AUT ACS FCI FAA Moga Sta Seat HOG Wea A MD SAT PS Seat AH AeA RETA CST HW ACA A 
FHC MRO CAH SAAD HA PICT BAIS TG GCS Sisi CHMOD DICH GE ACA HICH GAT SUPA ACH TIS VAT A CATA CATA CIC AATE CAAT AE AGI 
DIT SETE SBA AA Ys AMM AL GA CT ICTS HT SVT CXDIS CAS ACAD GAGA RA ABTA Catal fa AZ CRUG NA FACT ACA OAC AL Fl A Sica 
FE TA” 


Figure 4. Selected Bengali pangrams 


E- s p 

asia aa -esrer, Cona -eer amaa Onna fre corar 
TAO marie eer OPa E OAN OTE OA 
Iya, -SA — aru, aame aa aaor oe aoe -e en AEA 
Tera, — vu Spores manr OTT eG HOT BI (ay eva; 
—efzrb omer R ame es > BT sorla “pier o Esp are 


aie e age “esr Wolter ce a GG cor | 


ya cer 


F = 
Teme Suge Sth c —-— cea ola Q 
EL PIP 


artes art | ANE meret ota corer canetry RO e 
—Á i Vrs —X - anye ae 7 Beane arte! omes erts a dam ey 

——VEgeamzAO ART “aes to ries oss — tv ass jd 
Sb Vek sera oreo eres —À BL ATE n ado 


-SmE  —3348)133 "RIT a TAC Carer CITT 7019 OTIC 


CCP a4 cargo yp) —LRauex5—— "Es arte SN eer a A 33— — 
Bis oran Obie —A 2 ~are syr 3 ^ O4 — —— 
X525 N ay —of -earo -shoe —Cacu o aah O 4 
— Cac —»Xv i ce Core sats Roy CC Ke (2 S ———— 


—Arv aA st Vevey -ISe EA 


Figure 5. Sample image data taken from writer 
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The dataset was gathered from students at Pabna University of Science and Technology. All the 
students are between 20-25 both male and female. The authors used A4-sized paper and a consistent ball-point 
pen for script. Participants received instructions to write about a pre-selected topic. Thus, individual text 
comprises approximately the same quantity of words. 

From Figure 6 we can see that five of the classes have a higher frequency than the rest of the classes. 
This is because each pangram has common words in them. Table 2 includes the common words of each 
pangram. 


50 100 
classes 


Figure 6. Words (classes) distribution of the dataset 


Table 2. Common words of each pangram 
Label Class] Class2 


E 74 57 
Ej 60 76 
En 43 89 
KG 42 114 
SCS 119 124 


4.3.2. Data extraction 

The handwritten images are digitized using smartphone cameras with 1280p resolution. The dataset 
encompasses a total of 150 images, captured using smartphone cameras. These pictures are processed using 
CamScanner App to remove all the unwanted noise from the image. For example, various lighting effects, 
flashlight glares, and shadow effects. But in some images, still some noise remains. 


4.3.3. Data processing and labeling 

Processing and labeling huge amount of data is trivial task. Initially our approach was to extract 
bounding region of words based on the space between the words in a sentence with the help of OpenCV find 
contours method (which finds the all-connected components of an image). But due to cursive handwriting style 
words gets smashed to each other. So automated script is not an ideal option for us and had to do it manually. 
For manual annotation software like LabelMe [25] each word’s bounding box needs to be hand drawn even 
though we can auto generate bounding boxes for some word based on the spacing exploit. That’s why we 
created a completely new annotation software handwritten automated word annotation (HAWAN). 


5. HANDWRITTEN AUTOMATED WORD ANNOTATION 

HAWAN is a web-based software to ease the handwritten word annotation including bounding box 
generation and labeling. It's generated annotation info into a JSON format. 

Figure 7 show the initial prediction of the word bounding box based on the OpenCV find contours 
method. Initial prediction contains lots of unwanted bounding boxes. HAWAN provides some sophisticated 
actions to remove all unwanted bounding boxes. It's also providing drag and drop word labeling to ease the 
annotation task. 
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Figure 7. HAWAN 


wet afta 


: 2302-9285 


Figure 8 shows properly annotated words. If the bounding boxes are annotated it turn's bounding 
boxes color from red to green for visual inspection. Once all the bounding boxes are properly annotated, save 
action generates a JSON file with all the bounding boxes information and filename information. The bounding- 
box and label data for individually image were stored in specific JSON files. The specifying agreement adhered 
to that of the handwritten images. Figure 9 displays the parameters of the standard JSON file. To maintain the 


authenticity and quality of the dataset, we did not apply any augmentation to increase its size. 
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Figure 8. Properly annotated words 


Figure 9. Annotation information in JSON format 
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6. YOLO V5 MODEL 

YOLO is interestingly modest. A sole convolutional network is exploited to expect numerous 
bounding boxes and their corresponding class possibilities. YOLO employs training on complete images and 
straight enhances recognition performance. This integrated model offers some advantages over conventional 
approaches to object detection. As YOLO v5 is similar to other single-stage object detectors, YOLO, being a 
single-stage object detector, consists of three crucial components. Backbone, neck, and head, Figure 10 
demonstrations the straightforward architecture of YOLO v5. 


BackBone PANet Output 


Í Conv1x1 | 


{ Conv3x3 52 


| Convixi 


Conv1x1 
C Conv3x3 52 | 
i [ Conv | 


Í Convixl 


Figure 10. YOLO v5 simplified architecture 


In YOLO v5, the Leaky ReLU activation function is employed in the hidden layers, while the sigmoid 
activation function is utilized in the ultimate detection layer. SGD is the default optimizer function for the 
YOLO v5. YOLO v5 computes the loss by utilizing a weighted sum of three distinct losses: 

a. The class loss in YOLO v5 employs binary cross entropy (BCE); 

b. The objectness loss in YOLO v5 utilizes BCE; 

c. The location loss in YOLO v5 which is intersection over union (IoU). 

IoU loss is calculated by factoring overlap between the prediction and ground truth. 


7. RESULT AND DISCUSSION 
7.1. Model training 

We explored a range of techniques to train the model. Initially, we trained the model using batch sizes 
of 64 and an image size of 640. But due to low GPU memory issue batch size 6 is selected. We conducted our 
training with SGD optimizer and with default hyperparameters and anchors. Figure 11 depict the data labeling 
in training time. 


7.2. Metrics 

Our object detection models have two responsibilities: to locate the bounding box of an object and to 
predict the label assigned to that box. Mean average precision is the standard method for assessing the 
performance of mutually tasks performed by an object recognition model. When evaluating object detection 
models, we must assess not only the model’s classification abilities but also its ability to accurately locate 
objects. As a result, we require a distinct evaluation metric, known as the mean average precision (mAP). The 
fundamental concept is to combine the evaluation of detection and classification abilities. The method used by 
the mean average precision to determine the accuracy of a bounding box is called the IoU. The IoU for a 
predicted hopping box P, Q and its corresponding minced truth label is computed using (3): 


IoU(P,Q) = Seen (3) 


Area(P U Q) 


By utilizing IoU values, we investigate various methods for determining true positives and false 
positives. If the IoU value meets a certain threshold, we classify it as a true positive. For instance, if we establish 
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a threshold of 0.5, an example with an IoU greater than 0.5 is classified as a true positive, while those with an 
IoU less than or equal to 0.5 are considered false positives. To calculate precision, divide the number of true 
positives by the total number of positive predictions. This can be expressed as (4): 


True Positives 


Precision = (4) 


True Positives + False Negatives 


ir i ; 


ee ae | = 


5 
MZ 


Figure 11. Training data with labeling 


We compute metrics for IoU thresholds extending from 0.5 to 0.95 in increments of 0.05. The average 
precision for a class is then determined by averaging the precision values for that class crossways all IoU 
thresholds. Finally, the mAP for n classes, where k represents each class, is computed as (5): 


mAP = SAP, (5) 


The percentage of total predictions that the model properly recognizes is known as accuracy. Recall 
is the measure of the fraction of actual positives that the model correctly identifies. The F-1 score signifies the 
harmonic mean of recall and precision. Mathematically, the metrics are specified as (6) to (8): 


True Positives(TP)+ True Negatives(TN) 


accuracy LM _____ (6) 
True Positives+ True Negatives + False Positives + False Negatives 
TP 
recall = (7) 
FP +TP 
2(Precision * Recall) 
F1 = = (8) 


Precision + Recall 


8. EVALUATION 

During the training process, we monitored the act of the primary metrics on the validation set and 
tracked the loss to ensure that we were not overfitting. Let us first examine the loss graphs, which are divided 
into CLS loss, objective loss, box loss. Here from Figure 12, we see that *box loss', ‘obj_loss’ and ‘class_loss’ 
gets down smoothly till 290 epochs but after 290 epochs obj_loss tends to increase, a sign of overfitting. So, 
we stopped our training. Table 3 include evaluation time performance of our model. 
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Figure 12. Box_loss, obj_loss and class_loss graph 


Table 3. Evaluation time performance metrics 


Loss Train dataset — Val dataset 
Box loss 0.01 0.01 
Obj loss 0.14 0.15 
Class loss 0.009 0.01 


During training, we tracked the mAP for the range of 0.5 to 0.95. Figure 13 shows that, over epochs, 
there is indeed an inverse correlation between the mAP and the loss, as anticipated. From the Figure 14 we can 
see that machine generated document from the physical handwritten document in Figures 15(a) and (b). We 
can also see that there are some ambiguities while predicting certain classes (bolded text). This problem can 
be eliminated by growing the number of samples in the dataset. 
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Figure 13. mAP50 and mAP95 graph 
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egaa — fua — Sga 


ca ata mae stg Clancy 
at ata oqsa aea Mya IJA 
am  exma 79 ct rm pa Sta Sra 


denas waa 

eet ST SET Saias yae WHE sa See OF 
Stara wag AA 

LL NU oa am OT FA 


ou SRAI aH 


ama AA 
"— yaa aaa crews a WA 


mas C7 mn ona py ma 


Figure 14. Simplified predicted output 


Rita fra Tw, À FuN T Wan E emer NE 

Hor sama pr, EA NA o "egey THR "gd. FIM 
l er IRE N eT 

Mm aglr or fg CY WO AON 

wo AAR E Sattar om VAX TH BA aye CoA IEE 

aii er e a NU SEE Sar AVG Am OF 


/ E m v 
qux vens o us 8 "RE 
-Q Mi 
FARE quema A CH nyoy ag up Mw OY 
E etr Amag cw TELE SS HEM 
ju 


aa ap ana Se eel 


Leod aa ab de Ger SG Pur WE mui. LAT 
BU AAMT Dur Was qo Vm GNU Cran GU Bar 
emm ay Conga www dur we ef O 
dign mcdium c wp ug we Pu Re Pr 
sí Gba cwm Wm 


(a) (b) 


Figure 15. Handwritten input and output; (a) handwritten input given to the model and (b) predicted output 
with accuracy 
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From the confusion matrix in Figure 16(a), it may seem like class 74,76,89,114,124 are not classified 
properly but class (74,57), (76,60), (89,43), (114,42), (124,119) are same class. Due to lots of class it is printed 
as a heatmap for better understanding. Figure 16(b) indicates that some classes get misclassified with low 
probability (low color intensity), which can be eliminated by increasing the number of epochs in the training. 


SUBLSSBSSERESCRBRBSS wee 


BREReaeB 


(b) 


Figure 16. Matrix; (a) confusion matrix and (b) confusion matrix (zoomed) 


9. CONCLUSION 

In our research, we have addressed the task of recognizing Bengali words from handwritten 
documents. To facilitate research in this domain, we have developed and made publicly available our own 
dataset named “Omar Ekush”. The inclusion of this dataset will significantly aid future studies in the field of 
Bengali word recognition. Additionally, we have developed a custom annotation software called HAWAN, 
which streamlines the process of dataset annotation, further enhancing the accessibility and usability of the 
dataset. Our experiments have demonstrated the effectiveness of the YOLO v5 model trained on the “Omar 
Ekush” dataset for word recognition in handwritten images. The model showcased impressive performance, as 
measured by the mAP metrics. This indicates that our model can accurately and robustly identify Bengali words 
from handwritten documents. Furthermore, the ‘Omar Ekush’ dataset's depth and diversity extend its utility 
beyond word recognition to areas like handwriting analysis and document understanding. Our annotation tool, 
HAWAN, has potential for adaptation to various languages. Together, they provide a robust foundation for 
future research, fostering innovation in handwriting recognition and document analysis. In conclusion, our 
work significantly contributes to academic and practical aspects of Bengali word recognition and handwriting 
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analysis. We encourage researchers and practitioners to utilize the “Omar Ekush’ dataset and explore 
HAWAN’s capabilities to advance handwritten text recognition and related fields. In the future, we aim to 
improve Bengali word recognition by expanding the dataset with diverse data, introducing more data classes, 
integrating the model with HAWAN for better annotations, optimizing hyperparameters, and utilizing a more 
powerful GPU for faster training and experimentation. These efforts will lead to enhanced accuracy and 
robustness in recognizing Bengali words from handwritten documents. 
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